Virus Evolution
◐ Oxford University Press (OUP)
Preprints posted in the last 30 days, ranked by how well they match Virus Evolution's content profile, based on 140 papers previously published here. The average preprint has a 0.07% match score for this journal, so anything above that is already an above-average fit.
Yan, A. W. C.; Riley, S.; McCaw, J. M.
Show abstract
Cell tropism, or the preference of a virus for particular cell types, has major implications for viral transmission, pathogenesis, and evolution. An increase in viral fitness -- increased within-host replication, also leading to increased transmission between hosts -- can result from a virus changing its cell tropism. This is illustrated in the context of influenza, where adaptation to infect cells expressing 2-6 linked sialic acid receptors enhances human-to-human transmissibility. Target cell populations differ not only in abundance but also in intrinsic properties such as susceptibility, viral production, and interferon responses, rendering the relationship between tropism and viral fitness multi-faceted and complex. Understanding how different cell tropisms quantitatively change fitness remains an important open question in virology and quantitative biology. Here, we present a within-host mathematical model that incorporates distinct target cell types differing in key properties, and examine how cell tropism affects viral fitness, as measured by metrics such as peak viral load, infection duration, or total virus produced. Our analysis reveals that tradeoffs may arise when cell types differ by multiple characteristics. We further demonstrate that model parameters describing heterogeneity between cell types can be more accurately inferred when cell type proportions are measured alongside viral load. Our findings provide a framework for assessing the links between viral evolution, cell tropism, and within-host fitness, and motivate the design of experiments to collect quantitative data on between-cell heterogeneity.
Turner, S. A.; Olivier, J.; Ellis, M. L.; Floyd, K. A.; Lai, L.; Scheaffer, S. M.; Hastings, I.; Darling, T. L.; Miller, B. A.; Patel, C. J.; Peck, H.; Vanover, D.; Santangelo, P. J.; Diamond, M. S.; Suthar, M. S.; Boon, A. C. M.; Smith, D. J.
Show abstract
BA.3.2, a variant of SARS-CoV-2 containing [~]40 mutations in its spike protein compared to its nearest ancestor, has spread globally since its first detection in South Africa in November 2024. Here, we report antigenic characterization of BA.3.2 viruses in three naive animal models, and visualize its antigenic phenotype in the context of SARS-CoV-2 evolution using antigenic cartography. We find that: (1) BA.3.2 is substantially antigenically divergent from existing SARS-CoV-2 variants; (2) infection with BA.3.2 in hamster and mouse animal models produces sera with lower homologous titer than infection with other variants. Both of these results may have implications for the selection of vaccine antigens.
Haddox, H. K.; Hinrichs, A. S.; Jennings-Shaffer, C.; Johnson, K.; Benton, C. T.; Galloway, J. G.; Bloom, J. D.; Matsen, F. A.
Show abstract
Influenza viruss rapid evolution is shaped by both neutral mutation and selection. Phylogenetics can be used to study these processes, but this approach has typically only been applied to a few thousand influenza genome sequences at once. Here, we built phylogenetic trees with >100,000 influenza sequences, and then used these trees to estimate neutral rates of mutations to the viruss genome. Neutral rates varied by up to ~100-fold among the 12 nucleotide mutation types (A[->]C,A[->]G, etc.). These rates were highly correlated among influenza, SARS-CoV-2, and HIV, though more nuanced context-dependent patterns showed marked differences between influenza and SARS-CoV-2. We also estimated fitness effects of mutations by comparing the number of times a mutation was observed to occur along the branches of a tree to the number of times we expect it to have occurred under neutrality. We estimated effects for ~33,000 nonsynonymous and ~8,000 synonymous mutations spanning all influenza proteins. This compendium of estimated effects helps map the relationship between sequence and fitness in a natural setting, including regions where synonymous mutations are under functional constraint, and for proteins with limited experimentally measured effects. We built interactive heatmaps of the estimated fitness effects to help readers explore these data (see https://matsen.group/flu-mut-rates). Altogether, this work places influenzas mutation rates in a broader cross-viral context and deepens our understanding of how mutation and selection shape influenza evolution in nature at a site-specific level.
Courcelles, M.; Tounkara, K.; Mantip, S.; Niang, M.; Kounta Sidibe, C. A.; Sery, A.; Dakouo, M.; Luka, P. D.; Adedeji, A.; Shamaki, D.; Muhammad, M.; Ali, Y. H.; Saeed, I. K.; Awuni, J.; Odoom, T.; Tetteh, P. A.; Yingar, D. T.; Wade, A.; Dickmu, S.; Diddi, A.; Shawash, H.; Couacy-Hymann, E.; Mathurin, K. Y.; Ouled Ahmed Ben Ali, H.; Ben Hassen, S.; hadouchi, s.; Alm-ajali, A.; Settypalli, T. B. K.; Lamien, C. E.; Salami, H.; Rassoul, S.; Asnaoui, M.; Cetre-Sossah, C.; Guendouz, S.; Kwiatek, O.; Libeau, G.; Dundon, W. G.; Bataille, A.
Show abstract
Peste des petits ruminants (PPR) is a highly contagious viral disease of small ruminants caused by the peste des petits ruminants virus (PPRV), which is classified into four distinct genetic lineages (I-IV). A critical concern in the recent epidemiological history of PPRV is the rapid and widespread expansion of lineage IV (LIV) across West Africa over the past decade. This dominance suggests a potential adaptive advantage of circulating LIV strains in the regions current epidemiological context. In this study, we obtain the genome sequence of 26 new PPRV samples, including historical (pre-2000) and many recent African LIV isolates, offering the first opportunity to investigate the evolutionary history of LIV in Africa and identify genetic events potentially associated with its recent spread. Phylogenomic analyses implemented on a dataset of 167 curated PPRV genome sequences reveal that the most ancestral LIV group comprises strains circulating in Sub-Saharan Africa (designated clade LIVssa), providing robust evidence for an African origin of lineage IV. Our results further indicate that PPRV strains linked to the recent West African expansion of LIV belong to a specific LIVssa subgroup, termed NigB. We identified multiple signatures of selection pressure within the LIVssa sublineage, particularly in the NigB cluster. Several amino acid substitutions unique to LIVssa or NigB were detected, some of which may impact protein function and warrant prioritised investigation. Additional genomic data are required to confirm the association between the NigB group and the ongoing spread of LIV in West Africa. The evolutionary adaptations observed in LIVssa - potentially enhancing transmission efficiency, host range or pathogenicity - could undermine current disease control strategies in regions where PPR poses significant threats to food security and local economies. Author SummaryPeste des petits ruminants virus (PPRV) infects sheep and goats across Africa, Middle East, Asia and Europe, causing disease with major impact on global economy and food security. One genetic lineage of PPRV, called lineage IV (LIV), is at the origin of most recent expansion of the distribution of the disease, including replacement of other lineages in areas of African where PPRV is historically present. Here, we generated genome sequences from PPRV LIV isolates from different dates and places to study the evolution of this genetic lineage and explore whether its recent spread can be associated with the appearance of new mutations in the virus genome. Our results provide evidence that the PPRV LIV originated in Sub-Saharan Africa and identify mutations present only virus isolates currently spready in new regions of Africa. Further research should investigate the impact of these mutations on protein functions and capacity of transmission of PPRV.
Dee, K.; Imrie, R.; MacLean, O.; Mojsiejczuk, L.; Smith, E.; Raveendran, S.; Lamb, K.; Chen, H.; Schultz, V.; Wang, Z.; Walsh, S. K.; Zhang, J.; Hutchinson, E. K.; Willett, B. J.; Thomson, E. C.; Hughes, J. C.; Robertson, D. L.; Illingworth, C. L.; Murcia, P.
Show abstract
The emergence in 2025/26 of the influenza A/H3N2 K substrain (H3N2/K) was the cause of significant public health concern. This genetically divergent virus was assessed to have a strongly decreased reactivity to contemporary vaccine strains. Respectively prolonged and early influenza seasons in the Southern and Northern Hemispheres contributed to concerns about vaccine efficacy. Here we retrospectively assessed the genetic and antigenic properties of this virus, combining epidemiological surveillance data, computational antigenic analysis, and serological data using samples from a well-stratified UK cohort. In contrast to initial indications, we found that despite the genetic distinctiveness of H3N2/K the virus had undergone limited antigenic change, suggesting that its emergence was instead the result of selection for non-antigenic properties. We confirmed previous results showing that contemporary vaccines produced an enhanced neutralising response to H3N2/K but, in a stratified serological analysis, showed that responses to the J and K substrains were age-dependent, largely driven by patterns of vaccination. Our results have implications for antigenic surveillance and for public communication strategies in future influenza seasons.
Zecchin, B.; Monne, I.; Dianati, M.; Bortolami, A.; Savegnago, E.; Shkodra, E.; Revilla Fernandezd, S.; Steensels, M.; Van Borm, S.; Ivanova, E.; Roncevic, I.; Savic, V.; Nagy, A.; Hjulsager, C. K.; Thorup, C.; Larsen, L. E.; Nurmoja, I.; Kauppinen, A.; Tammiranta, N.; Briand, F.-X.; Grasland, B.; Ahrens, A. K.; Pohlmann, A.; Gunther, A.; Harder, T.; Malik, P.; Garza Cuartero, L.; Cvetkova, S.; Kibilds, J.; Steingolde, Z.; Pumputis, E.; Pileviciene, S.; Snoeck, C. J.; Bourg, M.; Groza, O.; Bellido Martin, B.; Fouchier, R.; Thewessen, S.; Vuong, O.; Ballmann, M.; Engelsma, M.; Arnason Boe, C.;
Show abstract
Since 2020, high pathogenicity avian influenza H5Nx viruses of clade 2.3.4.4b have become enzootic in Europe, causing recurrent epidemic waves characterized by extensive reassortment events. Here, we describe the emergence of a single high-fitness genotype (EA-2024-DI) that has driven two consecutive waves, evolving into distinct sub-lineages. While its circulation is ongoing, during the 2025-2026 wave it caused an unprecedented number of cases in wild birds. Using phylodynamic analyses of a large dataset of genomic sequences, we compared the spatial diffusion and host transmission pattern of the EA-2024-DI sub-lineages across the three most recent epidemic waves (2023-2024, 2024-2025 and 2025-2026). We show that the genotype has persisted over time and has spread primarily through wild Anseriformes, but with a marked change in the transmission patterns between the different waves and a shift in the epicenter from Eastern to Central Europe, the latter having emerged as an important hub for virus diffusion throughout Europe. Our results reveal a recent increase in the frequency of viruses from wild and domestic mammals carrying mutations enhancing virus replication in mammalian hosts, highlighting the importance of proactive monitoring of this group of hosts to better understand its role in the virus ecology and evolution.
Alrefae, T. A.; Pons-Salort, M.; Donnelly, C. A.; Lambert, B.; Kamau, E.
Show abstract
AO_SCPLOWBSTRACTC_SCPLOWSerological assays remain the standard experimental approach for estimating the cumulative incidence of a pathogen and monitoring population immunity. The predominant approach for analysing serum titration data from virus neutralisation assays uses a nearly century-old interpolation-based method which neglects inherent imperfections in the assay and produces estimates with no measure of uncertainty. We introduce a two-part Bayesian modelling framework to estimate the underlying antibody concentrations in the raw serum samples taken from serosurveyed individuals, to improve the interpretation of serological data over age. First, we develop a mechanistic Bayesian model for serum antibody titration data that estimates latent antibody concentrations while accounting for assay variability and quantifying uncertainty. Second, we propagate this uncertainty into an age-structured serocatalytic model by integrating over posterior draws of individual antibody concentrations, allowing joint inference on latent serostate membership, force of infection, and serological waning rate. We use this framework to explore the dynamics of infection and immunity for three enterovirus serotypes: enteroviruses A71 (EV-A71) and D68 (EV-D68) and coxsackievirus A6 (CVA6). These serotypes are leading causes of outbreaks of severe respiratory illness and hand, foot, and mouth disease. Applying these approaches to three cross-sectional serosurveys, we estimated consistently higher and more persistent antibody concentrations throughout life for EV-D68 compared to EV-A71 and CVA6. Our analysis suggests that the proportion of recently infected individuals (i.e. individuals with high estimated antibody concentration levels given their age) peaks around 25% by age 7 years for both EV-A71 and CVA6 before gradually declining with age. In contrast, for EV-D68 the inferred proportion of the population in the infected state exceeds 50% by age 9 years and continues to grow with age. We also estimate that EV-D68 antibody concentration levels are higher than those of the other two serotypes, with the force of infection estimated to be highest in early childhood and declining more gradually with age than for EV-A71 and CVA6. These estimates are different to previous estimates found in the literature. Our inferential framework uncovers the wide-ranging variation in antibody levels that are often obscured by conventional endpoint titre estimation methods. We demonstrate that our framework can infer infection rates without relying on predetermined seropositivity cut-offs and without making explicit assumptions of virus-specific infection mechanisms. Author summarySerological tests measure antibody levels in blood to show how widely a virus has spread and how well populations are protected. Titre-based tests dilute blood samples in steps, mix these dilutions with virus, and add the mixture to living cells; the titre is the highest dilution where antibodies still protect cells from infection. Traditional analyses overlook test imperfections. We present a new two-part Bayesian framework to estimate antibody levels and track age-related exposure to infection. First, we estimate underlying antibody concentrations while accounting for uncertainty, then use these estimates in another model to infer age-specific transmission of three common viruses - EV-A71, EV-D68, and CVA6. Our results show that EV-D68 infections may be more common, especially in children, compared to the other viruses. This new approach provides a clearer picture of the dynamics of seroconversion, without relying on arbitrary thresholds, helping to improve public health monitoring and responses.
Grigson, S. R.; Geliashvili, N.; Schubert, T.; Bouras, G.; Mallawaarachchi, V.; Bogacz, M.; Hellmich, U.; Edwards, R. A.; Dutilh, B. E.
Show abstract
Bacteriophages (phages) play essential roles in microbial systems, yet most phage proteins remain poorly characterised. Protein tertiary and quaternary structure information contributes valuable information about protein function. As many phage proteins function as homooligomers, complexes that consist of multiple identical subunits, there is great interest in computationally predicting their configurations. Here we present a computational framework, the Phage Homomer Level Estimate and Generation Method (PHLEGM) for inferring homooligomeric states directly from the protein sequence by combining AlphaFold-Multimer modelling with inter-subunit interface quality assessment. We proceeded to experimentally validate two out of nine predicted homooligomers using size exclusion chromatography and complementary hydrodynamic techniques. These efforts confirmed our predictions for a dimer and a trimer, highlighting the value of experimentally benchmarked computational predictions and showing the challenges of heterologous phage protein production. Applied to >22,000 phage protein sequences in the PHROGs database, our approach revealed extensive diversity in phage homooligomeric protein complexes. Benchmarking against protein language model-based predictors on a curated reference set of known phage homooligomers demonstrated superior accuracy of our structure-based method, achieving robust performance in classifying protein homooligomeric states, with the highest accuracy observed for trimers and higher-order complexes. These results highlight the value of computational predictions to decipher the complexities of the vast viral sequence space. All predicted complex structures and functional inferences are made publicly available to support structural and functional studies of phage proteins.
Muston, P.; Triebel, S.; Nawrocki, E.; Ontiveros-Palacios, N.; Jandalala, I.; Sweeney, B.; Bateman, A.; Marz, M.; Petrov, A. I.; Madrigal, P.
Show abstract
Rfam is a comprehensive database of non-coding RNA (ncRNA) families providing curated sequence alignments, consensus secondary structures, and covariance models for thousands of RNA families. The database is essential for identifying structured non-coding RNAs in newly sequenced genomes and understanding RNA structure-function relationships. Here we present computational protocols for automated ncRNA annotation of viral genomes, and for programmatic interaction with Rfam through its RESTful API. We showcase genome-wide RNA structure visualization from a genome sequence and from a multiple sequence alignment by generating comprehensive 2D structure diagrams using newly developed features in R2DT. We also present practical examples for retrieving family metadata, downloading alignments, accessing secondary structures, and searching user sequences from the Rfam API. These methods enable researchers in virology and RNA biology to integrate Rfam data into custom bioinformatics pipelines, comparative analyses, and machine learning workflows.
Dewari, P. S.; Regan, T.; Chapuis, A. F.; Florea, A.; Furniss, J. J.; Clark, T. C.; Taylor, R. S.; Bean, T. P.
Show abstract
BackgroundThe Pacific oyster (Crassostrea/Magallana gigas) is increasingly recognised as a model marine invertebrate. Valued for both ecological and commercial importance, Pacific oysters are farmed widely, supporting global food security by providing a sustainable nutrient-rich source of protein. Despite the significant and recurring economic losses caused by Ostreid herpesvirus (OsHV-1) outbreaks, only a limited number of studies have examined host-pathogen interplay at single-cell resolution. The few available studies largely focus on circulating immune cells (haemocytes), thereby overlooking the complexity of host responses across different tissues and organs. ResultsWe present a detailed single-nucleus transcriptomic atlas of the whole Pacific oysters, including during OsHV-1 infection. A total of 18 distinct transcriptomic clusters were resolved, capturing major cell populations from the gill, mantle, hepatopancreas, adductor muscle, and haemocytes. Notably, three populations- gill ciliary cells, hepatopancreas cells, and an immune-enriched cluster 1- exhibited pronounced transcriptomic responses to OsHV-1 infection. Across the 6, 24, 72, and 96 hours post-infection (hpi) time course, viral transcripts were detected almost exclusively at 72 hpi, with enrichment primarily in adductor muscle cells and two immune cell populations- immature haemocytes, and hyalinocytes. ConclusionsOur findings suggest potential entry portals and tissue-specific replication sites for the OsHV-1 virus in Pacific oysters. This atlas resource provides a high-resolution cellular framework for understanding host-virus interactions and establishes a foundation for future investigations into herpesvirus pathogenesis in marine invertebrates.
Maachi, A.; Donaire, L.; Aranda, M. A.
Show abstract
Tomato brown rugose fruit virus (Tobamovirus fructirugosum) is an emerging virus that affects tomatoes, capsicum, and chili. Since its first detection in Jordan in 2015, the virus was reported in more than 40 countries across all the continents. In Morocco, the virus was reported for the first time in October 2021. However, its genetic diversity remains unexplored. In this work, we used a collection of tomato fruits from local markets to investigate the variability of the virus in the country. We explored the different pressures acting on the N-terminus of the RNA-dependent RNA polymerase, the movement protein, and the coat protein genes. Then, we used haplotype network analyses to reveal the population structure within the Moroccan isolates and studied their relationships with the ones from the world. We found that genetic diversity is low, which is consistent with the global situation. No signatures of diversifying selection were detected across the analyzed genes. However, the virus sequences from Morocco showed a clear geographic structure, suggesting that geographic factors probably combined with agricultural practices may contribute to shaping the population structure of ToBRFV in Morocco.
Rodamilans, B.; Rincon Barrado, M.; Cobos, A.; Simon Mateo, C.; Valli, A. A.
Show abstract
The family Potyviridae represents the largest and most economically important group of plant-infecting RNA viruses. Despite extensive study of crop-associated members, the full diversity, host range, and evolutionary history of potyvirids remain poorly understood. Here, we conducted a large-scale mining of publicly available RNA-seq datasets to systematically search for novel potyvirid sequences. This approach enabled the identification and assembly of 47 previously undescribed members of the family, distributed across eight recognized genera and, importantly, two putative new genera. Beyond expanding the known genetic diversity of Potyviridae, our analyses revealed a distinct and deeply divergent lineage of potyvirid-like viruses associated with fungi and oomycetes, for which we propose the genus Macrophovirus. These viruses possess compact genomes and atypical genomic organizations, including the absence of canonical plant cell-to-cell movement factors and the presence of HCPro-like proteins arranged in tandem. Comparative structural and phylogenetic analyses indicate that these leader proteases are more closely related to fungal hypoviral counterparts than to canonical potyvirid HCPro factors. Together, our findings substantially expand the host range of Potyviridae, provide compelling evidence that potyvirid-like viruses likely infect fungi and oomycetes in nature, and offer new insights into the evolutionary pathways that have shaped this major virus family.
Encinas, P. A.; O'Boyle, B.; Maksiaev, A.; Nelson, M. I.; Garcia-Sastre, A.; del Real, G.
Show abstract
Influenza A virus (IAV) circulates widely in European pig populations and continues to diversify through frequent introductions from humans, followed by reassortment within swine. Spain represents a particularly dynamic ecological setting due to the coexistence of intensive whitepig production, extensive Iberianpig systems, and abundant wild boar populations. This study provides an integrated analysis of IAV evolution and genomic diversity in swine in Spain between 2019 and 2022, expanding on previous surveillance from 2016 to 2019. Sampling across 24 provinces yielded 66 new wholegenome sequences from Iberian and white pigs. We identified 18 genotypes, including 11 novel reassortants not detected in our previous survey. Several genotypes, such as H1huN2 G21 and G22, H3N2 G23, and the unusual H3N1 G12, were exclusive to the country. Some genotypes were detected across white pigs, Iberian pigs, and wild boar in Toledo and Badajoz, suggesting viral flow among swine populations. Phylogenetic analyses revealed ongoing introductions of H1N1pdm09 from humans into pigs, generating at least five reassortant genotypes (G10, G16-G19). These lineages incorporated pandemic internal cassettes and, in some cases, humanseasonal N2 segments, highlighting the continued role of humans as a source of viral incursions. Conversely, four zoonotic infections (H1N1v) detected in Spain between 2022 and 2026 were linked to genotypes circulating in white pigs, underscoring the bidirectional nature of IAV transmission at the human swine interface. Overall, this study demonstrates that Spain provides ecological conditions conducive to IAV diversification, reassortment, and zoonotic risk. The findings reinforce the need for sustained One Health surveillance. HighlightsO_LINovel swine influenza virus (SIV) genotypes exclusive to Spain C_LIO_LIPhylogenetic analysis of genomic segments of zoonotic variants of swine origin detected in Spain since 2022 C_LIO_LIShared circulation of influenza A compatible with interbreed transmission among domestic pigs and wild boar C_LI
Doherty, R.; Lewandowski, K.; Fenwick, A.; Everall, I.; Morley, D.; Hartman, H.; Staplehurst, S.; Kent, C.; Loman, N. J.; Quick, J.; Pullan, S. T.
Show abstract
As part of preparedness activities supporting pathogens classified under the UK High Consequence Infectious Diseases (HCID) framework, we previously evaluated both a whole-genome tiling amplicon sequencing scheme and a pan-viral hybridisation capture approach (TWIST-CVRP) for sequencing Andes virus (ANDV). In light of the recent outbreak, we make available viral sequencing datasets generated using a historical ANDV isolate (Chile, 1997). In addition, we provide an evaluation of tiling amplicon scheme performance and present recommended primer updates informed by in silico comparison with the recently released outbreak genome. These datasets are intended to support benchmarking, validation, and optimisation of bioinformatic pipelines across the community.
Miotti, N.; Bono, F.; Ratti, C.; Casati, P.; Turina, M.; Ciuffo, M.
Show abstract
Tomato fruit blotch virus (ToFBV) is an emerging multipartite positive-sense RNA virus associated with blotchy symptoms on tomato fruits and classified within the genus Blunervirus (family Kitaviridae). Despite its increasing agricultural relevance, the study of ToFBV has been hindered by the lack of mechanical transmissibility and the difficulty in reproducing infections under controlled conditions. In this work, we report a preliminary step toward the development of the first infectious agroclone system for ToFBV, based on full-length cDNA copies of its four genomic RNAs. We demonstrate that the cloned viral genome is capable of initiating cell autonomous replication in Nicotiana benthamiana, as indicated by the accumulation of negative-sense RNA intermediates in infiltrated tissues. To further validate the system, RNA3 was engineered to express GFP, enabling visualization of infection foci and confirming active viral replication in both N. benthamiana and tomato. Functional assays of RNA4-encoded proteins demonstrated that it encodes a movement protein capable of complementing movement-deficient viral vectors and a putative suppressor of post-transcriptional gene silencing (PTGS). Together, these results establish a versatile reverse genetics platform for ToFBV, providing new insights into the replication and functional organization of blunerviruses and enabling future studies on virus-host interactions, pathogenicity, and control strategies.
Nugier, Q.; Bouras, G.; Galiez, C.; Petit, M.-A.; Enault, F.
Show abstract
Viruses are abundant, ancestral and potentially fast-evolving biological entities. As a result, their encoded proteins are diverse and identifying homologous relationships between sequences is as important for phylogeny and functional annotation as it is challenging. Traditional methods group viral proteins by sequence similarity, build HMM profiles for each protein family, and cluster further via profile comparisons. Here, we present an improved framework where HMM sensitivity is boosted by enriching reference virus HMM profiles with tens of millions of metagenomic sequences. This increases diversity within most protein families, raising the diversity index from less than 2 for 92.7% of clusters to a median value of 6. This enrichment of the profiles more than triples the number of homologies detected compared to the raw profiles. First-step clusters are then grouped more effectively using these relationships and further unified via structural predictions and comparisons. The sequence-enrichment strategy excels at linking small proteins, while structures better connect highly structured ones like tail and head proteins. Applied to 1.42 million proteins, our method yields 56,560 families--far fewer than 200,018 (sequence-based) or 135,048 (raw HMM)--revealing that prior approaches vastly overestimated viral protein diversity. The strategy of enriching the diversity of sequences of interest with external sequences, combined with the complementary use of structural information, highlights deep evolutionary links, offering a more accurate picture of viral protein evolution.
Lebatteux, D.; Corso, F.; Soudeyns, H.; Boucoiran, I.; Gantt, S.; Banire Diallo, A.
Show abstract
Distinguishing closely related viral strains requires identifying genomic regions where subtle sequence differences carry biological significance. While k-mer-based approaches offer computational efficiency for genome analysis, existing methods lack standardized frameworks for evaluating which k-mers are most informative. Current selection strategies focus primarily on statistical discriminative power without integrating biological relevance. We introduce KmerSignificance Score (KSS), a k-mer prioritization framework combining three components: an information-theoretic method measuring strain-distinguishing capacity, an optimized amino acid substitution matrix (MIYATA EVO) for mutation impact assessment, and protein-level functional importance scoring derived from UniProt annotations. KSS produces standardized scores in the [0, 1] interval, enabling direct cross-dataset comparison. The discriminative component achieved classification performance comparable or superior to all tested alternatives (mean F1 = 0.880 vs. 0.718-0.877 for six established methods) while additionally providing bounded scores with consistent empirical distributions for cross-dataset comparability. MIYATA EVO, optimized via genetic algorithm, improved biophysical property correlations by 28.4% over the original MIYATA matrix. Protein scoring on 17,470 viral proteins showed robust agreement with UniProt annotation scores (Kendall{tau} = 0.777) while revealing finer functional distinctions. Literature validation on SARS-CoV-2 (278,738 sequences, 19 variants), HIV-1 (12,223 sequences, 15 subtypes), and human cytomegalovirus (HCMV; 399-646 sequences, 4-8 genotypes) confirmed that high-scoring k-mers consistently map to established variant-defining mutations, subtype-specific polymorphisms, and genotype markers. KSS provides a standardized framework for viral k-mer prioritization with applications in variant surveillance, molecular epidemiology, and functional annotation. The tool is available at https://github.com/bioinfoUQAM/KmerSignificanceScore. Author summaryIdentifying genetic differences between closely related viral strains is essential for pandemic preparedness, vaccine development, and understanding disease outbreaks. With millions of viral genomes now sequenced, researchers need tools that can rapidly pinpoint which genomic differences matter most biologically, not just which are statistically distinctive. Current k-mer-based approaches identify patterns that distinguish viral strains but cannot assess whether those differences affect protein function or disease phenotype. We developed KmerSignificance Score (KSS), a framework that we designed to rank short genomic sequences by combining three types of evidence: how well they distinguish viral strains, how much the encoded amino acid changes affect protein properties, and how functionally important the affected protein is. We standardized the resulting scores on a 0-to-1 scale, allowing direct comparison across different viruses and studies. We validated our framework on three major human pathogens (SARS-CoV-2, HIV-1, and human cytomegalovirus) and found that top-scoring positions consistently correspond to sites with documented roles in immune evasion, drug resistance, viral fitness, and strain classification. Our framework can help prioritize genomic features for surveillance of emerging variants, guide experimental validation, and support molecular epidemiology.
Krasilnikova, L. A.; Bouton, L.; Brock-Fisher, T. M.; Decker, E.; Godec, M.; Thompson, Z.; Dart, E.; Gao, F.; Gladden-Young, A.; Messer, K. S.; Norville, J.; Specht, I.; Osinski, A.; Li, J.; Lones, C.; DeRuff, K. C.; Siddle, K. J.; Church, D.; Benton, C.; Hansen, K.; Bowen, H.; Bhattacharyya, S.; Epie, N.; Brown, C. M.; Madoff, L. C.; MacInnis, B. L.; Gallagher, G. R.; Smole, S.; Bean, C.; Talbot, E. A.; Burns, M.; Doucette, M.; Fortes, E.; Park, D. J.; Sabeti, P. C.; Wohl, S.
Show abstract
Despite the existence of an effective vaccine, the United States continues to experience outbreaks of hepatitis A, including in Massachusetts (MA) and New Hampshire (NH) in 2018 and again in MA in 2023. To clarify the relationship between these outbreaks and better understand their drivers, we generated hepatitis A virus whole genome sequences from reported cases and analyzed them using open-source genotyping tools developed and released as part of this study. We found that the 2018 and 2023 outbreaks were caused by distinct viral strains, despite affecting individuals with similar demographic characteristics and reported risk factors. Detailed analysis of genomic and epidemiologic data further resolved transmission patterns within and across outbreaks, showing that experiencing homelessness and prior use of drugs were associated with increased transmission while also revealing transmission between individuals with and without these risk factors, as well as spread across state borders. Together, these findings demonstrate the value of broadly accessible genomic tools for understanding hepatitis A outbreaks and illustrate how whole genome analysis can complement epidemiological investigation by resolving transmission patterns and outbreak drivers that can inform public health interventions.
Merrick, C.; Kegode, I.; Leach, S.; Kale, M.; Heiden, D.; Beckham, J. D.
Show abstract
Flaviviruses like Zika virus (ZIKV), contain RNA tertiary structures within the 3 untranslated region (UTR) that halt the 5-to-3 RNA exonuclease, Xrn1. Halting of Xrn1 at the two RNA structures, termed exonuclease-resistantRNA1 and 2 (xrRNA1 and xrRNA2), results in the formation of subgenomic flavivirus RNAs (sfRNA) that support viral pathogenesis. While the role of the flavivirus xrRNA1 in pathogenesis is well characterized, the role of the flavivirus xrRNA2 structure is not well studied. Using xrRNA crystal structure data, we inserted structure-informed mutations in ZIKV xrRNA2 to disrupt tertiary folding independent of significant sequence changes, evaluate sfRNA production, and define pathogenesis in a murine model of ZIKV infection. Compared to our prior work with ZIKV xrRNA1, we found that ZIKV xrRNA2 is under increased selection pressure to maintain sfRNA production resulting in multiple targeted mutations in xrRNA2 junctional region to induce a stable mutant. Using three targeted xrRNA junctional mutations termed ZIKV X2.L1, we found that the resulting ZIKV clone exhibits attenuated cell death in cultures and decreased viral growth in tissue cultures. In a murine model of ZIKV infection, mice inoculated with ZIKV X2.L1 exhibit significantly decreased symptomatic infection, improved survival, decreased end-organ infection in the brain, and continued robust neutralizing antibody responses to ZIKV. Despite attenuation, serum from ZIKV X2.L1-infected mice or mice vaccinated with ZIKV X2.L1, exhibited 100% protection from lethal ZIKV challenge. These studies show that RNA structure-informed mutations provide a robust model for flavivirus attenuation and vaccine design. Additional studies defining the mechanisms of robust neutralizing antibody responses and flavivirus-specific vaccine development are needed to continue the development of this novel vaccine platform approach for medically important flavivirus infections. Author summaryZika virus is a member of the Orthoflavivirus (referred to as flavivirus) genus that exhibit conserved RNA structures in the 3 untranslated region of the viral RNA genome. Two concerned RNA structures, termed exonuclease-resistant RNA 1 and 2, are important to support the ability of the virus to cause disease. While the first RNA structure is well studied, less is known about the role of exonuclease-resistant RNA 2 in the flavivirus infection. Using reverse genetics, we made mutations in the Zika virus exonuclease-resistant RNA 2 structure and studied how this mutant Zika virus was weakened or attenuated. We found that the mutant Zika virus clone exhibits reduced virus replication, reduced ability to kill cells, and decreased virulence in mouse models of Zika virus disease. Using this mutant virus as a potential vaccine candidate, we found that Zika virus with mutations in the exonuclease-resistant RNA 2 structure provide complete protection from lethal Zika virus challenge. These data suggest that targeting the second exonuclease resistant RNA structure in flaviviruses is a viable platform for the development of vaccine candidates for this important group of viruses.
Nguyen, H.-H.; Rudar, J.; Mubareka, S.; Lapen, D.; Berhane, Y.; Leung, C. K.; Lung, O.
Show abstract
BackgroundInfluenza A virus (IAV) is a major public health burden, causing seasonal epidemics and occasional pandemics. Its transmission from avian species to mammals and subsequent spread requires adaptive changes in the viral genome. Understanding these molecular adaptations is essential for pandemic preparedness, and machine learning offers a powerful approach to uncover the evolution and biology of IAV. ResultsOur calibrated WaveSeekerNet model accurately predicted the host source of 8 IAV segments (Macro F1-score: 0.9728), significantly improving the reliability of predicted probabilities, with calibration errors approaching zero. Interpretation showed that avian-adapted IAVs consistently activated G/C content, whereas mammalian-adapted IAVs generally activated A/T content. This distinction was confirmed by codon-level analysis, in which G/C-rich codons were rewarded for the avian hosts and A/T-rich codons for the mammalian hosts. We defined host-adaptive distance to quantify species barriers and proposed it as a risk-assessment metric. We hypothesized the Mammalian Adaptation Zone (MAZ), a zone where the virus is expected to adjust its host-adaptive distance to reach, thereby helping it establish persistent mammalian lineages. The analysis also revealed the Hard Distance of avian-origin viruses (e.g., H5Nx, H9N2), indicating they have not yet established persistent mammalian lineages. Finally, analysis of human H7N9 (2013, China) and non-human mammalian H5Nx (North America) viruses showed that WaveSeekerNet accurately identified key mammalian-adaptive mutations, including PB2-E627K and PB2-D701N. ConclusionsWaveSeekerNet elucidated IAV host-adaptation mechanisms in silico, providing insights into the underlying mechanisms of host adaptation and informing improved surveillance and intervention strategies.